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Introduction 

Overview of the use of a matrix times a vector for the description of signal 
and systems operations. The vectors are descriptions of the signals and the 
matrix operator is a description of the system. 


Introduction 


The tools, ideas, and insights from linear algebra, abstract algebra, and 
functional analysis can be extremely useful to signal processing and system 
theory in various areas of engineering, science, and social science. Indeed, 
many important ideas can be developed from the simple operator equation 
Equation: 


Ax=b 


by considering it in a variety of ways. If x and b are vectors from the same 
or, perhaps, different vector spaces and A is an operator, there are three 
interesting questions that can be asked which provide a setting for a broad 
study. 


1. Given A and x , find b . The analysis or operator problem or 
transform. 

2. Given A and b, find x . The inverse or control problem or 
deconvolution or design. 

3. Given x and b, find A . The synthesis or design problem or parameter 
identification. 


Much can be learned by studying each of these problems in some detail. We 
will generally look at the finite dimensional problem where [link] can more 
easily be studied as a finite matrix multiplication [link], [link], [link], [link] 
Equation: 


Qi1 a12 413 °°: GIN L4 by 
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but will also try to indicate what the infinite dimensional case might be 
[ink], [Link], [link], [Link]. 


An application to signal theory is in [link], to optimization [link], and 
multiscale system theory [link]. The inverse problem (number 2 above) is 
the basis for a large study of pseudoinverses, approximation, optimization, 
filter design, and many applications. When used with the /2 norm [link], 
[link] powerful results can be optained analytically but used with other 
norms such as |, 1, Jp (a pseudonorm), an even larger set of problems can 
be posed and solved [link], [link]. 


A development of vector space ideas for the purpose of presenting wavelet 
representations is given in [link], [link]. An interesting idea of 
unconditional bases is given by Donoho [link]. 


Linear regression analysis can be posed in the form of [link] and [link] 
where the M rows of A are the vectors of input data from M experiments, 
entries of x are the N weights for the N components of the inputs, and the 
M values of b are the outputs [link]. This can be used in machine learning 
problems [link], [link]. A problem similar to the design or synthesis 
problem is that of parameter identification where a model of some system is 
posed with unknown parameters. Then experiments with known inputs and 
measured outputs are run to identify these parameters. Linear regression is 
also an example of this [link], [link]. 


Dynamic systems are often modelled by ordinary differential equation 
where b is set to be the time derivative of x to give what are called the 
linear state equations: 

Equation: 


x = Ax 


or for difference equations and discrete-time or digital signals, 
Equation: 


x(n +1) = Ax(n) 


which are used in digital signal processing and the analysis of certain 
algorithms. State equations are useful in feedback control as well as in 
simulation of many dynamical systems and the eigenvalues and other 
properties of the square matix A are important indicators of the 
performance [link], [link]. 


The ideas of similarity transformations, diagonalization, the eigenvalue 
problem, Jordon normal form, singular value decomposition, etc. from 
linear algebra [link], [link], [link] are applicable to this problem. 


Various areas in optimization and approximation use vector space math to 
great advantage [link], [link]. 


This booklet is intended to point out relationships, interpretations, and tools 
in linear algebra, matrix theory, and vector spaces that scientists and 
engineers might find useful. It is not a stand-alone linear algebra book. 
Details, definitions, and formal proofs can be found in the references. A 
very helpful source is Wikipedia. 


There is a variety software systems to both pose and solve linear algebra 
problems. A particularly powerful one is Matlab [link] which is, in some 
ways, the gold standard since it started years ago a purely numerical matrix 
package. But there are others such as Octave, SciLab, LabVIEW, 
Mathematica, Maple, etc. 


A Matrix Times a Vector 

One can look at the operation of a matrix times a vector as changing the basis set for 
the vector or as changing the vector with the same basis description. Many signal 
and systems problem can be posed in this form. 


A Matrix Times a Vector 


In this chapter we consider the first problem posed in the introduction 
Equation: 


Ax=b 


where the matrix A and vector x are given and we want to interpret and give 
structure to the calculation of the vector b . Equation [link] has a variety of special 
cases. The matrix A may be square or may be rectangular. It may have full column 
or row rank or it may not. It may be symmetric or orthogonal or non-singular or 
many other characteristics which would be interesting properties as an operator. If 
we view the vectors as signals and the matrix as an operator or processor, there are 
two interesting interpretations. 


e The operation [link] is a change of basis or coordinates for a fixed signal. The 
signal stays the same, the basis (or frame) changes. 

e The operation [link] alters the characteristics of the signal (processes it) but 
within a fixed basis system. The basis stays the same, the signal changes. 


An example of the first would be the discrete Fourier transform (DFT) where one 
calculates frequency components of a signal which are coordinates in a frequency 
space for a given signal. The definition of the DFT from [link] can be written as a 
matrix-vector operation by ce = Wx which, for w = e /?"/" and N = 4, is 
Equation: 


Co www w Lo 
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An example of the second might be convolution where you are processing or 
filtering a signal and staying in the same space or coordinate system. 
Equation: 


Y¥0 ho 0 0 --- O LO 
V1 hy ho 0 1 
yo he hy ho Lo 


A particularly powerful sequence of operations is to first change the basis for a 
signal, then process the signal in this new basis, and finally return to the original 
basis. For example, the discrete Fourier transform (DFT) of a signal is taken 
followed by setting some of the Fourier coefficients to zero followed by taking the 
inverse DFT. 


Another application of [link] is made in linear regression where the input signals are 
rows of A and the unknown weights of the hypothesis are in x and the outputs are 
the elements of b. 


Change of Basis 
Consider the two views: 


1. The operation given in [link] can be viewed as x being a set of weights so that 
b is a weighted sum of the columns of A . In other words, b will lie in the 
space spanned by the columns of A at a location determined by x . This view 
is a composition of a signal from a set of weights as in [link] and [link] below. 
If the vector a; is the i*” column of A, it is illustrated by 
Equation: 


Ax=2@, a, +2 ag +23 a3 =b. 


2. An alternative view has x being a signal vector and with b being a vector 
whose entries are inner products of x and the rows of A. In other words, the 
elements of b are the projection coefficients of x onto the coordinates given by 
the rows of A. The multiplication of a signal by this operator decomposes the 
signal and gives the coefficients of the decomposition. If a; is the 7 row of A 
we have: 

Equation: 


by = [---azy---| xX by = |[---ag---| x etc. 


Regression can be posed from this view with the input signal being the rows of 
A. 


These two views of the operation as a decomposition of a signal or the 
recomposition of the signal to or from a different basis system are extremely 
valuable in signal analysis. The ideas from linear algebra of subspaces, inner 
product, span, orthogonality, rank, etc. are all important here. The dimensions of the 
domain and range of the operators may or may not be the same. The matrices may 
or may not be square and may or may not be of full rank [link], [link]. 


A Basis and Dual Basis 


A set of linearly independent vectors x, forms a basis for a vector space if every 
vector x in the space can be uniquely written 


Equation: 
x= S> Qn Xn 
n 


and the dual basis is defined as a set vectors Xy in that space allows a simple inner 
product (denoted by parenthesis: (x, y)) to calculate the expansion coefficients as 
Equation: 


A basis expansion has enough vectors but none extra. It is efficient in that no fewer 
expansion vectors will represent all the vectors in the space but is fragil in that 
losing one coefficient or one basis vector destroys the ability to exactly represent the 
signal by [link]. The expansion [link] can be written as a matrix operation 
Equation: 


Fa=x 


where the columns of F are the basis vectors X, and the vector a has the expansion 
coefficients a, as entries. Equation [link] can also be written as a matrix operation 
Equation: 


Fx=a 


which has the dual basis vectors as rows of F. From [link] and [link], we have 
Equation: 


FFx =x 
Since this is true for all x, 
Equation: 
FF =I 
or 
Equation: 
F=F? 


which states the dual basis vectors are the rows of the inverse of the matrix whose 
columns are the basis vectors (and vice versa). When the vector set is a basis, F is 
necessarily square and from [link] and [link], one can show 

Equation: 


Because this system requires two basis sets, the expansion basis and the dual basis, 
it is called biorthogonal. 


Orthogonal Basis 


If the basis vectors are not only independent but orthonormal, the basis set is its own 
dual and the inverse of F is simply its transpose. 
Equation: 


Foi-Pp-F* 


When done in Hilbert spaces, this decomposition is sometimes called an abstract 
Fourier expansion [link], [link], [Link]. 


Parseval's Theorem 


Because many signals are digital representations of voltage, current, force, velocity, 
pressure, flow, etc., the inner product of the signal with itself (the norm squared) is a 
measure of the signal energy q. 

Equation: 


| 
8 


q = (x, x) = ||x||? = x"x : 


Parseval's theorem states that if the basis system is orthogonal, then the norm 
squared (or “energy”) is invarient across a change in basis. If a change of basis is 
made with 


Equation: 
c= Ax 
then 
Equation: 
N-1 N-1 
q = (x,x) = |[x||? =x7x = Sa? = K(e,¢) = K\le||? = KeTe = KJ 
n=0 k=0 


for some constant K which can be made unity by normalization if desired. 


For the discrete Fourier transform (DFT) of z,, which is 
Equation: 


1 ; 
—j2ank/N 
Ck= eae 


n=0 


the energy calculated in the time domain: g = >, x2 is equal to the norm squared 
of the frequency coefficients: gq = >>, a, within a multiplicative constant of 1/N. 
This is because the basis functions of the Fourier transform are orthogonal: “the sum 
of the squares is the square of the sum” which means means the energy calculated in 
the time domain is the same as that calculated in the frequency domain. The energy 
of the signal (the square of the sum) is the sum of the energies at each frequency 
(the sum of the squares). Because of the orthogonal basis, the cross terms are zero. 
Although one seldom directly uses Parseval's theorem, its truth is what make sense 
in talking about frequency domain filtering of a time domain signal. A more general 
form is known as Plancherel theorem [link]. 


If a transformation is made on the signal with a non-orthogonal basis system, then 
Parseval's theorem does not hold and the concept of energy does not move back and 
forth between domains. We can get around some of these restrictions by using 
frames rather than bases. 


Frames and Tight Frames 


In order to look at a more general expansion system than a basis and to generalize 
the ideas of orthogonality and of energy being calculated in the original expansion 
system or the transformed system, the concept of frame is defined. A frame 
decomposition or representation is generally more robust and flexible than a basis 
decomposition or representation but it requires more computation and memory 
[link], [link], [link]. Sometimes a frame is called a redundant basis or representing 
an underdetermined or underspecified set of equations. 


If a set of vectors, f,, span a vector space (or subspace) but are not necessarily 
independent nor orthogonal, bounds on the energy in the transform can still be 
defined. A set of vectors that span a vector space is called a frame if two constants, 
A and B exist such that 

Equation: 


2 
0< Allx||" < $7 (fk,x)  < B\|x||’ < 00 
k 


and the two constants are called the frame bounds for the system. This can be 
written 


Equation: 


0 < Al|x||? < |lel|? < Bllx||? < 00 


where 
Equation: 


c — Fx 


If the f, are linearly independent but not orthogonal, then the frame is a non- 
orthogonal basis. If the f, are not independent the frame is called redundant since 
there are more than the minimum number of expansion vectors that a basis would 
have. If the frame bounds are equal, A = B, the system is called a tight frame and it 
has many of features of an orthogonal basis. If the bounds are equal to each other 
and to one, A = B = 1, then the frame is a basis and is tight. It is, therefore, an 
orthogonal basis. 


So a frame is a generalization of a basis and a tight frame is a generalization of an 
orthogonal basis. If , A = B, the frame is tight and we have a scaled Parseval's 
theorem: 

Equation: 


Al|x||? =) 7 (fe, x)|" 


k 


If A = B > 1, then the number of expansion vectors are more than needed for a 
basis and A is a measure of the redundancy of the system (for normalized frame 
vectors). For example, if there are three frame vectors in a two dimensional vector 
space, A = 3/2. 


A finite dimensional matrix version of the redundant case would have F in [link] 
with more columns than rows but with full row rank. For example 


Equation: 
fae ao1 | bo 
Ly — 
19 O11 G12] by 


has three frame vectors as the columns of A but in a two dimensional space. 


The prototypical example is called the Mercedes-Benz tight frame where three 
frame vectors that are 120° apart are used in a two-dimensional plane and look like 
the Mercedes car hood ornament. These three frame vectors must be as far apart 
from each other as possible to be tight, hence the 120° separation. But, they can be 
rotated any amount and remain tight [link], [link] and, therefore, are not unique. 
Equation: 


LO 


1 -0.5 0.5 [bo 
0 0.866 —0.866/ “' ~ |p, 
LQ 


In the next section, we will use the pseudo-inverse of A to find the optimal x for a 
given b. 


So the frame bounds A and B in [link] are an indication of the redundancy of the 
expansion system f; and to how close they are to being orthogonal or tight. Indeed, 
[link] is a sort of approximate Parseval's theorem [link], [link], [link], [link], [link], 
[link], [link], [Link]. 


The dual frame vectors are also not unique but a set can be found such that [link] 
and, therefore, [link] hold (but [link] does not). A set of dual frame vectors could be 
found by adding a set of arbitrary but independent rows to F until it is square, 
inverting it, then taking the first N columns to form F whose rows will be a set of 
dual frame vectors. This method of construction shows the non-uniqueness of the 
dual frame vectors. This non-uniqueness is often resolved by optimizing some other 
parameter of the system [link]. 


If the matrix operations are implementing a frame decomposition and the rows of F 
are orthonormal, then F = FT and the vector set is a tight frame [link], [link]. If the 
frame vectors are normalized to ||x, || = 1, the decomposition in [link] becomes 
Equation: 


c= 5 Fn) >a 


where the constant A is a measure of the redundancy of the expansion which has 
more expansion vectors than necessary [link]. 


The matrix form is 
Equation: 


where F has more columns than rows. Examples can be found in [link]. 


Sinc Expansion as a Tight Frame 


The Shannon sampling theorem [link] can be viewied as an infinite dimensional 
signal expansion where the sinc functions are an orthogonal basis. The sampling 
theorem with critical sampling, i.e. at the Nyquist rate, is the expansion: 
Equation: 


g(t) = >> g(Tn) 


where the expansion coefficients are the samples and where the sinc functions are 
easily shown to be orthogonal. 


Over sampling is an example of an infinite-dimensional tight frame [link], [link]. If 
a function is over-sampled but the sinc functions remains consistent with the upper 
spectral limit W, using A as the amount of over-sampling, the sampling theorem 
becomes: 

Equation: 


AW = a for A>1 


and we have 
Equation: 


where the sinc functions are no longer orthogonal. In fact, they are no longer a basis 
as they are not independent. They are, however, a tight frame and, therefore, have 
some of the characteristics of an orthogonal basis but with a “redundancy" factor A 
as a multiplier in the formula [link] and a generalized Parseval's theorem. Here, 
moving from a basis to a frame (actually from an orthogonal basis to a tight frame) 
is almost invisible. 


Frequency Response of an FIR Digital Filter 


The discrete-time Fourier transform (DTFT) of the impulse response of an FIR 
digital filter h(n) is its frequency response. The discrete Fourier transform (DFT) of 
h(n) gives samples of the frequency response [link]. This is a powerful analysis 
tool in digital signal processing (DSP) and suggests that an inverse (or 
pseudoinverse) method could be useful for design [link]. 


Conclusions 


Frames tend to be more robust than bases in tolerating errors and missing terms. 
They allow flexibility is designing wavelet systems [link] where frame expansions 
are often chosen. 


In an infinite dimensional vector space, if basis vectors are chosen such that all 
expansions converge very rapidly, the basis is called an unconditional basis and is 
near optimal for a wide class of signal representation and processing problems. This 
is discussed by Donoho in [link]. 


Still another view of a matrix operator being a change of basis can be developed 
using the eigenvectors of an operator as the basis vectors. Then a signal can 
decomposed into its eigenvector components which are then simply multiplied by 
the scalar eigenvalues to accomplish the same task as a general matrix 
multiplication. This is an interesting idea but will not be developed here. 


Change of Signal 


If both x and b in [link] are considered to be signals in the same coordinate or basis 
system, the matrix operator A is generally square. It may or may not be of full rank 
and it may or may not have a variety of other properties, but both x and b are 
viewed in the same coordinate system and therefore are the same size. 


One of the most ubiquitous of these is convolution where the input to a linear, shift 
invariant system with impulse response h(n) is calculated by [link] if A is the 
convolution matrix and x is the input [link]. 


Equation: 
Y¥0 ho 0 0 0 4) 
V1 hi ho 0 LI 
y2 he hy ho LQ 


It can also be calculated if A is the arrangement of the input and x is the the 
impulse response. 


Equation: 
¥0 LO 0 0 0 ho 
V1 Z1 Xo 0 hy 
y2 &2 L1 Xo ho 


If the signal is periodic or if the DFT is being used, then what is called a circulate is 
used to represent cyclic convolution. An example for NV = 4 is the Toeplitz system 
Equation: 


Yo ho hg he hi Zo 
Y1 hy ho hg he 2 
Yy2 hg hy ho hg 22 
Yy3 hg ho hi ho 23 


One method of understanding and generating matrices of this sort is to construct 
them as a product of first a decomposition operator, then a modification operator in 
the new basis system, followed by a recomposition operator. For example, one could 
first multiply a signal by the DFT operator which will change it into the frequency 
domain. One (or more) of the frequency coefficients could be removed (set to zero) 
and the remainder multiplied by the inverse DFT operator to give a signal back in 
the time domain but changed by having a frequency component removed. That is a 


form of signal filtering and one can talk about removing the energy of a signal at a 
certain frequency (or many) because of Parseval's theorem. 


It would be instructive for the reader to make sense out of the cryptic statement “the 
DFT diagonalizes the cyclic convolution matrix" to add to the ideas in this note. 


Factoring the Matrix A 


For insight, algorithm development, and/or computational efficiency, it is sometime 
worthwhile to factor A into a product of two or more matrices. For example, the 
DFT matrix [link] illustrated in [link] can be factored into a product of fairly 
sparce matrices. If fact, the fast Fourier transform (FFT) can be derived by factoring 
the DFT matrix into N log (JV) factors (if N = 2), each requiring order N 
multiplies. This is done in [link]. 


Using eigenvalue theory [link], a full rank square matrix can be factored into a 
product 
Equation: 


AV=VA 


where V is a matrix with columns of the eigenvectors of A and A is a diagonal 
matrix with the eigenvalues along the diagonal. The inverse is a method to 
“diagonalize" a matrix 

Equation: 


A=V'AV 


If a matrix has “repeated eigenvalues", in other words, two or more of the V 
eigenvalues have the same value but less than N indepentant eigenvectors, it is not 
possible to diagonalize the matrix but an “almost” diagonal form called the Jordan 
normal form can be acheived. Those details can be found in most books on matrix 
theory [link]. 


A more general decompostion is the singular value decomposition (SVD) which is 
similar to the eigenvalue problem but allows rectangular matrices. It is particularly 
valuable for expressing the pseudoinverse in a simple form and in making numerical 
calculations [link]. 


State Equations 


If our matrix multiplication equation is a vector differential equation (DE) of the 
form 
Equation: 


x= Ax 


or for difference equations and discrete-time signals or digital signals, 
Equation: 


x(n +1) = Ax(n) 


an inverse or even pseudoinverse will not solve for x . A different approach must be 
taken [link] and different properties and tools from linear algebra will be used. The 
solution of this first order vector DE is a coupled set of solutions of first order DEs. 
If a change of basis is made so that A is diagonal (or Jordan form), equation [link] 
becomes a set on uncoupled (or almost uncoupled in the Jordan form case) first 
order DEs and we know the solution of a first order DE is an exponential. This 
requires consideration of the eigenvalue problem, diagonalization, and solution of 
scalar first order DEs [link]. 


State equations are often used to model or describe a system such as a control 
system or a digital filter or a numerical algorithm [link], [link]. 


General Solutions of Simultaneous Equations 


The second problem posed in the introduction is basically the solution of 
simultaneous linear equations [link], [link], [link] which is fundamental to 
linear algebra [link], [link], [link] and very important in diverse areas of 
applications in mathematics, numerical analysis, physical and social sciences, 
engineering, and business. Since a system of linear equations may be over or 
under determined in a variety of ways, or may be consistent but ill 
conditioned, a comprehensive theory turns out to be more complicated than it 
first appears. Indeed, there is a considerable literature on the subject of 
generalized inverses or pseudo-inverses. The careful statement and 
formulation of the general problem seems to have started with Moore [link] 
and Penrose [link], [link] and developed by many others. Because the 
generalized solution of simultaneous equations is often defined in terms of 
minimization of an equation error, the techniques are useful in a wide variety 
of approximation and optimization problems [link], [link] as well as signal 
processing. 


The ideas are presented here in terms of finite dimensions using matrices. 
Many of the ideas extend to infinite dimensions using Banach and Hilbert 
spaces [link], [link], [link] in functional analysis. 


The Problem 


Given an M by N real matrix A and an M by 1 vector b, find the NV by 1 
vector x when 


Equation: 
Qi1 G12 Qi3 -*: Gin | | 21 by 
Q21 22 Q23 L2 by 
G31 432 33 x3 | — | 3 
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or, using matrix notation, 
Equation: 


Ax=b 


If b does not lie in the range space of A (the space spanned by the columns of 
A), there is no exact solution to [link], therefore, an approximation problem 
can be posed by minimizing an equation error defined by 

Equation: 


e— Ax—b. 


A generalized solution (or an optimal approximate solution) to [link] is 
usually considered to be an x that minimizes some norm of ¢. If that problem 
does not have a unique solution, further conditions, such as also minimizing 
the norm of x, are imposed. The /2 or root-mean-squared error or Euclidean 


norm is VeT’e and minimization sometimes has an analytical solution. 
Minimization of other norms such as /,, (Chebyshev) or J; require iterative 
solutions. The general /,, norm is defined as q where 

Equation: 


a= |lel|p = (s: e or ’ 


for 1 < p < o anda “pseudonorm" (not convex) for 0 < p < 1. These can 
sometimes be evaluated using IRLS (iterative reweighted least squares) 
algorithms [link], [link], [link], [link], [link]. 


If there is a non-zero solution of the homogeneous equation 
Equation: 


Ax = 0, 


then [link] has infinitely many generalized solutions in the sense that any 
particular solution of [link] plus an arbitrary scalar times any non-zero 
solution of [link] will have the same error in [link] and, therefore, is also a 
generalized solution. The number of families of solutions is the dimension of 
the null space of A. 


This is analogous to the classical solution of linear, constant coefficient 
differential equations where the total solution consists of a particular solution 
plus arbitrary constants times the solutions to the homogeneous equation. The 
constants are determined from the initial (or other) conditions of the solution 
to the differential equation. 


Ten Cases to Consider 


Examination of the basic problem shows there are ten cases [link] listed in 
Figure 1 to be considered. These depend on the shape of the / by N real 
matrix A, the rank r of A, and whether b is in the span of the columns of A. 


la. M = N =r: One solution with no error, €. 

1b. MM =N>r:b € span{A}: Many solutions with ¢ = 0. 

1c. M =N>r:bnot € span{A}: Many solutions with the same 
minimum error. 

2a.M > N=r:b € span{A}: One solution e = 0. 

2b. M > N=r:bnot € span{A}: One solution with minimum error. 
2c.M>N>r:b € span{A}: Many solutions with e = 0. 
2d.M > N>r:bnot € span{A}: Many solutions with the same 
minimum etror. 

3a. N > M =r: Many solutions with ¢ = 0. 

3b. N > M>r:b © span{A}: Many solutions with e = 0 
3c.N > M >r:bnot € span{A}: Many solutions with the same 
minimum error. 


Figure 1. Ten Cases for the Pseudoinverse. 


Here we have: 


case 1 has the same number of equations as unknowns (A is square, 

M — N), 

case 2 has more equations than unknowns, therefore, is over specified (A 
is taller than wide, M > N), 

case 3 has fewer equations than unknowns, therefore, is underspecified 
(A is wider than tall N > M). 


This is a setting for frames and sparse representations. 


In case 1a and 3a, b is necessarily in the span of A. In addition to these 
classifications, the possible orthogonality of the columns or rows of the 
matrices gives special characteristics. 


Examples 


Case 1: Here we see a 3 x 3 square matrix which is an example of case 1 in 
Figure 1 and 2. 


Equation: 
Q11 G12 G13| | 21 by 
G21 G22 G23] | 22} = | be 
Q31 a32 @33| | 23 b3 


If the matrix has rank 3, then the b vector will necessarily be in the space 
spanned by the columns of A which puts it in case 1a. This can be solved for 
x by inverting A or using some more robust method. If the matrix has rank 1 
or 2, the b may or may not lie in the spanned subspace, so the classification 


will be 1b or 1c and minimization of ||:||> yields a unique solution. 


Case 2: If A is 4 x 3, then we have more equations than unknowns or the 
overspecified or overdetermined case. 


Equation: 

Q11 Q12 413 by 
Ly 

A21 422 Q23 bo 
“9 = 

G31 432 33 b3 
£3 

Q41 G42 G43 b4 


If this matrix has the maximum rank of 3, then we have case 2a or 2b 
depending on whether b is in the span of A or not. In either case, a unique 
solution x exists which can be found by [link] or [link]. For case 2a, we have 
a single exact solution with no equation error, € = O just as case 1a. For case 
2b, we have a single optimal approximate solution with the least possible 


equation error. If the matrix has rank 1 or 2, the classification will be 2c or 2d 
and minimization of ||z||3 yelds a unique solution. 


Case 3: If A is 3 x 4, then we have more unknowns than equations or the 
underspecified case. 


Equation: 

LI 

Q4i 432 413 14 by 
LQ 

Q21 422 423 G24 = |bo 
£3 

Q31 432 433 434 bs 
L4 


If this matrix has the maximum rank of 3, then we have case 3a and b must be 
in the span of A . For this case, many exact solutions x exist, all having zero 
equation error and a single one can be found with minimum solution norm 
||x|| using [link] or [link]. If the matrix has rank 1 or 2, the classification will 
be 3b or 3c. 


Solutions 


There are several assumptions or side conditions that could be used in order to 
define a useful unique solution of [link]. The side conditions used to define 
the Moore-Penrose pseudo-inverse are that the 2 norm squared of the 
equation error € be minimized and, if there is ambiguity (several solutions 
with the same minimum error), the /2 norm squared of x also be minimized. 
A useful alternative to minimizing the norm of x is to require certain entries 
in x to be zero (sparse) or fixed to some non-zero value (equality constraints). 


In using sparsity in posing a signal processing problem (e.g. compressive 
sensing), an /; norm can be used (or even an /g “pseudo norm”) to obtain 
solutions with zero components if possible [link], [link]. 


In addition to using side conditions to achieve a unique solution, side 
conditions are sometimes part of the original problem. One interesting case 
requires that certain of the equations be satisfied with no error and the 
approximation be achieved with the remaining equations. 


Moore-Penrose Pseudo-Inverse 


If the Jj norm is used, a unique generalized solution to [link] always exists 
such that the norm squared of the equation error e7’e and the norm squared 
of the solution x? x are both minimized. This solution is denoted by 
Equation: 


x=Atb 
where A™ is called the Moore-Penrose inverse [link] of A (and is also called 
the generalized inverse [link] and the pseudoinverse [link]) 


Roger Penrose [link] showed that for all A, there exists a unique At 
satisfying the four conditions: 


Equation: 
AATA=A 
Equation: 
ATAA*=A™ 
Equation: 
[AA*] =AA* 
Equation: 


[A*A] =ATA 


There is a large literature on this problem. Five useful books are [link], [link], 
[link], [link], [link]. The Moore-Penrose pseudo-inverse can be calculated in 
Matlab [link] by the pinv(A, tol) function which uses a singular value 
decomposition (SVD) to calculate it. There are a variety of other numerical 
methods given in the above references where each has some advantages and 
some disadvantages. 


Properties 


For cases 2a and 2b in Figure 1, the following N by N system of equations 
called the normal equations{link], [link] have a unique minimum squared 
equation error solution (minimum €7 €). Here we have the over specified case 
with more equations than unknowns. A derivation is outlined in "Derivations", 
equation [link] below. 

Equation: 


AT Ax—A'™ b 


The solution to this equation is often used in least squares approximation 
problems. For these two cases A? A is non-singular and the N by M pseudo- 
inverse is simply, 

Equation: 


* —1 * 
At=[a™al a™. 
A more general problem can be solved by minimizing the weighted equation 
error, €£ W1 We where W is a positive semi-definite diagonal matrix of the 


error weights. The solution to that problem [link] is 
Equation: 


* * —1 * * 
At=|ATwTwal AT ww. 
For the case 3a in Figure 1 with more unknowns than equations, AA? is non- 
singular and has a unique minimum norm solution, ||x||. The NV by M 


pseudoinverse is simply, 
Equation: 


* * —1 
At=atT [AAT | 


with the formula for the minimum weighted solution norm ||z|| is 


Equation: 


At =[WTw] 'aT/|A[wTw] ‘A™| 


For these three cases, either [link] or [link] can be directly calculated, but not 
both. However, they are equal so you simply use the one with the non-singular 
matrix to be inverted. The equality can be shown from an equivalent 
definition [link] of the pseudo-inverse given in terms of a limit by 

Equation: 


* -1 * * * = 
At ni [AT A+ er A?’ te AT [AAT Me er 
6-0 6-0 


For the other 6 cases, SVD or other approaches must be used. Some properties 
[link], [link] are: 


At 
[A*] 

- [A*A]” = AtA*t 
eT = 1/d for A # O else 1° = 6 
-At=(a‘A]‘A=a"[aa’]" 
© A’ = A*’AA*+=A*TAA®™ 


It is informative to consider the range and null spaces [link] of A and At 
R(A) = R(AA*) = R(AA’) 

R(At)=R(A‘) = R(ATA) = R(A‘A) 

¢ R(I-AA*+)=N(AAt)=N(A a N (At) = R(A)* 
R(I—-A*A)=N(AtA)=N(A)=R(A a 


The Cases with Analytical Soluctions 


The four Penrose equations in [link] are remarkable in defining a unique 
pseudoinverse for any A with any shape, any rank, for any of the ten cases 


listed in Figure 1. However, only four cases of the ten have analytical 
solutions (actually, all do if you use SVD). 


e If A is case 1a, (Square and nonsingular), then 
Equation: 


At - Ao! 


e If A is case 2a or 2b, (over specified) then 
Equation: 


A+=[ATA] “AT 


e If A is case 3a, (under specified) then 
Equation: 


A*+=AT[AAT™] * 


Figure 2. Four Cases with Analytical Solutions 


Fortunately, most practical cases are one of these four but even then, it is 
generally faster and less error prone to use special techniques on the normal 
equations rather than directly calculating the inverse matrix. Note the matrices 
to be inverted above are all r by r (r is the rank) and nonsingular. In the other 
six cases from the ten in Figure 1, these would be singular, so alternate 
methods such as SVD must be used [link], [link], [link]. 


In addition to these four cases with “analytical” solutions, we can pose a more 
general problem by asking for an optimal approximation with a weighted 
norm [link] to emphasize or de-emphasize certain components or range of 
equations. 


e If A is case 2a or 2b, (over specified) then the weighted error 


pseudoinverse is 
Equation: 


* * —1 * * 
At= [AT wt wal AT ww 


e If A is case 3a, (under specified) then the weighted norm pseudoinverse 
is 
Equation: 


At =[wTw] ‘at [A [wTw] at 7 


Figure 3. Three Cases with Analytical Solutions and Weights 


These solutions to the weighted approxomation problem are useful in their 
own right but also serve as the foundation to the Iterative Reweighted Least 
Squares (IRLS) algorithm developed in the next chapter. 


Geometric interpretation and Least Squares Approximation 


A particularly useful application of the pseudo-inverse of a matrix is to 
various least squared error approximations [link], [link]. A geometric view of 
the derivation of the normal equations can be helpful. If b does not lie in the 
range space of A, an error vector is defined as the difference between Ax and 
b. A geometric picture of this vector makes it clear that for the length of € to 
be minimum, it must be orthogonal to the space spanned by the columns of A 
. This means that Ae = 0. If both sides of [link] are multiplied by A’, it is 
easy to see that the normal equations of [link] result in the error being 
orthogonal to the columns of A and, therefore its being minimal length. If b 
does lie in the range space of A, the solution of the normal equations gives 
the exact solution of [link] with no error. 


For cases 1b, Ic, 2c, 2d, 3a, 3b, and 3c, the homogeneous equation [link] has 
non-zero solutions. Any vector in the space spanned by these solutions (the 
null space of A.) does not contribute to the equation error € defined in [link] 
and, therefore, can be added to any particular generalized solution of [link] to 
give a family of solutions with the same approximation error. If the dimension 
of the null space of A is d, it is possible to find a unique generalized solution 
of [link] with d zero elements. The non-unique solution for these seven cases 
can be written in the form [link]. 

Equation: 


x= A*b+ [I — A‘Aly 


where y is an arbitrary vector. The first term is the minimum norm solution 
given by the Moore-Penrose pseudo-inverse A~ and the second is a 
contribution in the null space of A. For the minimum ||z||, the vector y = 0. 


Derivations 


To derive the necessary conditions for minimizing q in the overspecified case, 
we differentiate g = et € with respect to x and set that to zero. Starting with 
the error 

Equation: 


= ete = [Ax — b]* [Ax — b] = x’ ATAx— x™A™b— b™Ax+b"b 
Equation: 
g=x'A'’Ax-— 2x'’A'b+b'b 
and taking the gradient or derivative gives 
Equation: 
Vig =2ATAx— 2A Tb=0 
which are the normal equations in [link] and the pseudoinverse in [link] and 
[link]. 


If we start with the weighted error problem 
Equation: 


q = €' W™ We = [Ax — b]" WW [Ax — b] 


using the same steps as before gives the normal equations for the minimum 
weighted squared error as 


Equation: 
ATWTWAx = ATW" Wb 


and the pseudoinverse as 
Equation: 


x= [ATWTWA] “ATW TWb 


To derive the necessary conditions for minimizing the Euclidian norm ||z||, 
when there are few equations and many solutions to [link], we define a 
Lagrangian 

Equation: 


2 
L (x, 4) = || Wxl|} + wT (Ax — b) 
take the derivatives in respect to both x and yp and set them to zero. 
Equation: 
VL =2W'Wx+ATp=0 
and 
Equation: 
V,~% = Ax—b=0 
Solve these two equation simultaneously for x eliminating yz gives the 


pseudoinverse in [link] and [link] result. 
Equation: 


x= [WTw] 'A™|A[wTw] *A7| “'b 


Because the weighting matrices W are diagonal and real, multiplication and 
inversion is simple. These equations are used in the Iteratively Reweighted 
Least Squares (IRLS) algorithm described in the next chapter. 


Regularization 


To deal with measurement error and data noise, a process called 
“regularization” is sometimes used [link], [link], [link]. 


Least Squares Approximation with Constraints 


The solution of the overdetermined simultaneous equations is generally a least 
squared error approximation problem. A particularly interesting and useful 
variation on this problem adds inequality and/or equality constraints. This 
formulation has proven very powerful in solving the constrained least squares 
approximation part of FIR filter design [link]. The equality constraints can be 
taken into account by using Lagrange multipliers and the inequality 
constraints can use the Kuhn-Tucker conditions [link], [link], [link]. The 
iterative reweighted least squares (IRLS) algorithm described in the next 
chapter can be modified to give results which are an optimal constrained least 
p-power solution [link], [link], [link]. 


Conclusions 


There is remarkable structure and subtlety in the apparently simple problem of 
solving simultaneous equations and considerable insight can be gained from 
these finite dimensional problems. These notes have emphasized the J, norm 
but some other such as /,, and J; are also interesting. The use of sparsity 
[link] is particularly interesting as applied in Compressive Sensing [link], 
[link] and in the sparse FFT [link]. There are also interesting and important 
applications in infinite dimensions. One of particular interest is in signal 
analysis using wavelet basis functions [link]. The use of weighted error and 
weighted norm pseudoinverses provide a base for iterative reweighted least 
squares (IRLS) algorithms. 


Approximation with Other Norms and Error Measures 
Here we use a more general definition of norm in addition to L_2. In 
particular, we consider L_p. 


Approximation with Other Norms and Error Measures 


Most of the discussion about the approximate solutions to Ax = b are 
about the result of minimizing the /2 equation error || Az — b||, and/or the 
ly norm of the solution ||x||, because in some cases that can be done by 
analytic formulas and also because the /2 norm has a energy interpretation. 
However, both the J; and the /,,.{link] have well known applications that 
are important [link], [link] and the more general [,, error is remarkably 
flexible [link], [link]. Donoho has shown [link] that J; optimization gives 
essentially the same sparsity as the true sparsity measure in lo. 


In some cases, one uses a different norm for the minimization of the 
equation error than the one for minimization of the solution norm. And in 
other cases, one minimizes a weighted error to emphasize some equations 
relative to others [link]. A modification allows minimizing according to one 
norm for one set of equations and another for a different set. A more 
general error measure than a norm can be used which used a polynomial 
error [link] which does not satisfy the scaling requirement of a norm, but is 
convex. One could even use the so-called /,, norm for 1 > p > 0 which is 
not even convex but is an interesting tool for obtaining sparse solutions. 


-15 -1 -9.5 9 0.5 1 1.46 


Different /,, norms: p = .2, 1, 2, 10. 


Note from the figure how the /19 norm puts a large penalty on large errors. 
This gives a Chebyshev-like solution. The /g.2 norm puts a large penalty on 
small errors making them tend to zero. This (and the /; norm) give a sparse 
solution. 


The L, Norm Approximation 


The IRLS (iterative reweighted least squares) algorithm allows an iterative 
algorithm to be built from the analytical solutions of the weighted least 
squares with an iterative reweighting to converge to the optimal l, 
approximation [link]. 


The Overdetermined System with more Equations than Unknowns 


If one poses the /,, approximation problem in solving an overdetermined set 
of equations (case 2 from Chapter 3), it comes from defining the equation 
error vector 

Equation: 


e-— Ax—b 


and minimizing the p-norm 


Equation: 
1/p 
llell, = (> ef) 


or 
Equation: 


llell2 = 5 lenl? 


neither of which can we minimize easily. However, we do have formulas 
[link] to find the minimum of the weighted squared error 
Equation: 


\|Wel|z =) walen|” 


one of which is derived in [link], equation [link] and is 
Equation: 


x= {[ATWTWA] ATWTWb 


where W is a diagonal matrix of the error weights, w,,. From this, we 
propose the iterative reweighted least squared (IRLS) error algorithm which 
starts with unity weighting, W = I, solves for an initial x with [link], 
calculates a new error from{[link], which is then used to set a new weighting 
matrix W 

Equation: 


W = diag(w,)? >” 


to be used in the next iteration of [link]. Using this, we find a new solution 
x and repeat until convergence (if it happens!). 


This core idea has been repeatedly proposed and developed in different 
application areas over the past 50 years with a variety of success [link]. 
Used in this basic form, it reliably converges for 2 < p < 3. In 1990, a 
modification was made to partially update the solution each iteration with 
Equation: 


x(k) = gx (k) + (1 — q)x(k — 1) 


where X is the new weighted least squares solution of [link] which is used 
to partially update the previous value x(k — 1) using a convergence up-date 
factor 0 < q < 1 which gave convergence over a larger range of around 
1.5 < p < 5 but but it was slower. 


A second improvement showed that a specific up-date factor of 
Equation: 


significantly increased the speed of convergence. With this particular factor, 
the algorithm becomes a form of Newton's method which has quadratic 
convergence. 


A third modification applied homotopy [link], [link], [link], [link] by 
starting with a value for p which is equal to 2 and increasing it each 
iteration (or each few iterations) until it reached the desired value, or, in the 
case of p < 2, decrease it. This made a significant increase in both the 
range of p that allowed convergence and in the speed of calculations. Some 
of the history and details can be found applied to digital filter design in 
[link], [link]. 


A Matlab program that implements these ideas applied to our pseudoinverse 
problem with more equations than unknowns (case 2a) is: 


% M- 
file IRLS1.m to find the optimal solution to Ax=b 


% minimizing the L_p norm ||Ax-b|]|_p, using IRLS. 


% Newton iterative update of solution, x, for M 
> N. 


% For 2<p<infty, use homotopy parameter K = 1.01 
to 2 


% For O<p<2, use K = approx 0.7 - 0.9 
% csb 10/20/2012 

function x = IRLS1(A,b,p,K, KK) 

if nargin < 5, KK=10; end; 

if nargin < 4, K = 2; end; 


if nargin < 3, p = 10; end; 


pk = 2; % In 
tial homotopy value 
X = pinv(A)*b; % Ini 
tial L_2 solution 
E=[]; 
for k = 1:KK % Ite 
rate 

if p >= 2, pk = min([p, K*pk]); % Hom 


otopy change of p 


else pk = max([p, K*pk]); end 


e = A*x - b; % Err 
or vector 


w = abs(e).A((pk- 


2)/2); % Error weights for IRLS 
W = diag(w/sum(w) ); % Nor 
malize weight matrix 
WA = W*A; % app 
ly weights 
x1 = (WA'*WA)\ 
(WA'*W)*b; % weighted L_2 sol. 
q = 1/(pk- 
1); % Newton's paramete 
r 


if p> 2, xX = q*x1 + (1- 
q)*x; nn=p, % partial update for p>2 


else x = x1; nn=2; end % no 
partial update for p<2 


ee = norm(e,nn); E = [E ee]; % Err 
or at each iteration 


end 
plot(E) 


This can be modified to use different p's in different bands of equations or 
to use weighting only when the error exceeds a certain threshold to achieve 
a constrained LS approximation [link], [link], [link]. Our work was 
originally done in the context of filter design but others have done similar 
things in sparsity analysis [link], [link], [link]. 


This is presented as applied to the overdetermined system (Case 2a and 2b) 
but can also be applied to other cases. A particularly important application 
of this section is to the design of digital filters. 


The Underdetermined System with more Unknowns than Equations 


If one poses the /,, approximation problem in solving an underdetermined 
set of equations (case 3 from Chapter 3), it comes from defining the 


solution norm as 
1/p 
hie (Siew 4 


Equation: 
and finding x to minimizing this p-norm while satisfying Ax = b. 


It has been shown this is equivalent to solving a least weighted norm 
problem for specific weights. 
Equation: 


1/2 
I|z\|, = (> w(n)*|@ ot 


n 


The development follows the same arguments as in the previous section but 
using the formula [link], [link] derived in [link] 
Equation: 


x= [WTw] ‘AT|A[wTw] 'A7] “'p 


with the weights, w(n), being the diagonal of the matrix, W, in the 
iterative algorithm to give the minimum weighted solution norm in the 


same way as [link] gives the minimum weighted equation error. 


A Matlab program that implements these ideas applied to our pseudoinverse 
problem with more unknowns than equations (case 3a) is: 


% M- 
file IRLS2.m to find the optimal solution to Ax=b 


% minimizing the L_p norm ||x]||_p, using IRLS. 


% Newton iterative update of solution, x, for M 
< N. 


% For 2<p<infty, use homotopy parameter K = 1.01 
to 2 


% For O<p<2, use K = approx 0.7 to 0.9 
% csb 10/20/2012 

function x = IRLS2(A,b,p,K, KK) 

if nargin < 5, KK= 10; end; 

if nargin < 4, K = .8; end; 


if nargin < 3, p = 1.1; end; 


pk = 2; % Initial 
homotopy value 

X = pinv(A)*b; % Initial 
L_2 solution 

E={[]; 


for k = 1:KK 


if p >= 2, pk = min([p, K*pk]); % Homotopy 
update of p 


else pk = max([p, K*pk]); end 


W = diag(abs(x).4((2- 
pk)/2)+0.00001); % norm weights for IRLS 


AW = A*W; % applying 
new weights 
x1 = W*AW'* 
((AW*AW')\b); % Weighted L_2 solution 
q = 1/(pk- 
1); % Newton's parameter 


if p >= 2, x = q*x1i + (1- 
q)*x; nn=p; % Newton's partial update for p>2 


else x = x1; nn=1; end % no Newto 
n's partial update for p<2 


ee = norm(x,nn); E = [E ee]; % norm at 
each iteration 


end; 
plot(E) 


This approach is useful in sparse signal processing and for frame 
representation. 


The Chebyshev, Minimax, or [.. Appriximation 


The Chebyshev optimization problem minimizes the maximum error: 
Equation: 


Em =max |e (n)| 
n 


This is particularly important in filter design. The Remez exchange 
algorithm applied to filter design as the Parks-McClellan algorithm is very 
efficient [link]. An interesting result is the limit of an ||x||,, optimization as 


p — oo is the Chebyshev optimal solution. So, the Chebyshev optimal, the 
minimax optimal, and the L., optimal are all the same [link], [link]. 


A particularly powerful theorem which characterizes a solution to Ax = b 
is given by Cheney [link] in Chapter 2 of his book: 


e A Characterization Theorem: For an M by N real matrix, A with 
M > N, every minimax solution x is a minimax solution of an 
appropriate N + 1 subsystem of the M equations. This optimal 
minimax solution will have at least N + 1 equal magnitude errors and 
they will be larger than any of the errors of the other equations. 


This is a powerful statement saying an optimal minimax solution will have 
out of M/, at least N + 1 maximum magnitude errors and they are the 
minimum size possible. What this theorem doesn't state is which of the IZ 
equations are the VV + 1 appropriate ones. Cheney develops an algorithm 
based on this theorem which finds these equations and exactly calculates 
this optimal solution in a finite number of steps. He shows how this can be 
combined with the minimum ||e]|,, using a large p, to make an efficient 


solver for a minimax or Chebyshev solution. 


This theorem is similar to the Alternation Theorem [link] but more general 
and, therefore, somewhat more difficult to implement. 


The L; Approximation and Sparsity 


The sparsity optimization is to minimize the number of non-zero terms in a 
vector. A “pseudonorm", ||x||,, is sometimes used to denote a measure of 
sparsity. This is not convex, so is not really a norm but the convex (in the 
limit) norm ||x||, is close enough to the ||x||,) to give the same sparsity of 
solution [link]. Finding a sparse solution is not easy but interative 


reweighted least squares (IRLS) [link], [link], weighted norms [link], [link], 
and a somewhat recent result is called Basis Pursuit [link], [link] are 
possibilities. 


This approximation is often used with an underdetermined set of equations 
(Case 3a) to obtain a sparse solution x. 


Using the IRLS algorithm to minimize the /,, equation error often gives a 
sparse error if one exists. Using the algorithm in the illustrated Matlab 
program with p = 1.1 on the problem in Cheney [link] gives a zero error in 
equation 4 while using no larger p gives any zeros. 


Constructing the Operator (Design) 


Constructing the Operator (unfinished) 


Solving the third problem posed in the introduction to these notes is rather 
different from the other two. Here we want to find an operator or matrix 
that when multiplied by x gives b . Clearly a solution to this problem 
would not be unique as stated. In order to pose a better defined problem, we 
generally give a set or family of inputs x and the corresponding outputs b . 
If these families are independent, and if the number of them is the same as 
the size of the matrix, a unique matrix is defined and can be found by 
solving simultaneous equations. If a smaller number is given, the remaining 
degrees of freedom can be used to satisfy some other criterion. If a larger 
number is given, there is probably no exact solution and some 
approximation will be necessary. 


If the unknown operator matrix is of dimension by _ , then we take 
inputs x; for , each of dimension and the corresponding 

outputs by, each of dimension —_ and form the matrix equation: 
Equation: 


AX B 


where A isthe by unknown operator, X isthe by input matrix 
with columns which are the inputs x, and Bisthe by output 
matrix with columns by. The operator matrix is then determined by: 
Equation: 


A Bx! 


if the inputs are independent which means X is nonsingular. 


This problem can be posed so that there are more (perhaps many more) 
inputs and outputs than with a resulting equation error which can be 
minimized with some form of pseudoinverse. 


Linear regression can be put in this form. If our matrix equation is 
Equation: 


Ax b 


where A is a row vector of unknown weights and x is a column vector of 
known inputs, then b is a scaler inter product. If a seond experiment gives a 
second scaler inner product from a second column vector of known inputs, 
then we augment X to have two rows and b to be a length-2 row vector. 
This is continued for experiment to give [link] asailby row vector 
timesan by matrix whichequalsalby row vector. It this equation 
is transposed, it is in the form of [link] which can be approximately solved 
by the pesuedo inverse to give the unknown weights for the regression. 


Alternatively, the matrix may be constrained by structure to have less than 
degrees of freedom. It may be a cyclic convolution, a non cyclic 
convolution, a Toeplitz, a Hankel, or a Toeplitz plus Hankel matrix. 


A problem of this sort came up in research on designing efficient prime 
length fast Fourier transform (FFT) algorithms where x is the data and b is 
the FFT of x . The problem was to derive an operator that would make this 
calculation using the least amount of arithmetic. We solved it using a 
special formulation [link] and Matlab. 


